The 10 longest words with frequency>1, ordered by length
Length | Frequency | Word |
---|---|---|
78 | 2 | http://www.loksatta.com/lokrang-news/article-on-marathi-literature-77463/lite/ |
55 | 66 | http://www.censusindia.gov.in/2011census/dchb/DCHB.html |
25 | 2 | अभाअण्णाद्रमुक-मरूमलार्चि |
24 | 2 | अभिव्यक्तिस्वातंत्र्याचा |
24 | 8 | ट्यूबवेलच्या/बोअरवेलच्या |
23 | 4 | स्वातंत्र्यप्राप्तीनंतर |
23 | 3 | राष्ट्रीय-आंतरराष्ट्रीय |
23 | 3 | स्वातंत्र्यसंग्रामाच्या |
23 | 2 | मेक्लेनबुर्ग-फोरपोमेर्न |
22 | 2 | नोर्डर्हाईन-वेस्टफालन |
The longest words of the corpus with minimum frequency 2 are shown. The words are seen at least twice, hence, there is some chance for not seeing misprinted words.
Surprisingly, there is no longest word which is much longer than the second one. This, again, argues for correct preprocessing.
In the case of correct preprocessing, the longest words are true words. In many cases, they belong to some topics which can generate these long words.
In the case of poor preprocessing, some non-word strings will appear.
The length of the longest words clearly depends on language and corpus size.
select char_length(word) as le, freq, word from words where freq>1 order by le desc limit 10;
How does the length of the longest words increase with corpus size?
3.2.3.1 Longest Words in top-1000 by length